Preserving Geospatial Data: The National Geospatial Digital Archive’s Approach
نویسنده
چکیده
The National Geospatial Digital Archive (NGDA) is one of eight initial projects funded by the Library of Congress’s National Digital Information Infrastructure and Preservation Program (NDIIPP). The project’s overarching goal is to answer the question: How can we preserve geospatial data on a national scale and make it available to future generations? This paper summarizes the project’s work in four areas: analysis of the characteristics of geospatial data relevant to preservation; elucidation of the “relay” principles of long-term preservation; development of an OAIS-compliant archive system; and development of a wikiand repository-based format registry. Introduction The National Geospatial Digital Archive (NGDA), a partnership between the Map & Imagery Laboratory, Davidson Library, at the University of California at Santa Barbara, and Branner Earth Sciences Library at Stanford University, is one of eight initial projects funded by the Library of Congress’s National Digital Information Infrastructure and Preservation Program (NDIIPP). The project’s overarching goal is to answer the question: How can we preserve geospatial data on a national scale and make it available to future generations? Work on the project began in earnest in 2005 and immediately led to several new questions being posed: • What are the characteristics of geospatial data that impact preservation? • Given a desire to preserve information for a century or longer—a period of time far exceeding the lifetimes of the applications, platforms, and people involved in the information’s creation—is there any preservation architecture, or are there at least any general design principles or best practices, that can carry the information through a century of unforeseeable technological and social change? • Given a desire to preserve information on a large scale, can we define a minimal level or minimum standard of preservation that has a high chance of being achieved over the course of a century, without interruption or discontinuity, so that the information remains (at least potentially) as useful as when it was created, despite unforeseeable fluctuations in available resources devotable to the information’s curation over time, and fluctuations in interest in the information and in the information’s perceived value? 1 http://www.ngda.org/ 2 http://www.digitalpreservation.gov/ This paper summarizes NGDA’s work in answering these questions. In the next section we list characteristics of geospatial data relevant to preservation. In the subsequent two sections, we elucidate three principles of long-term preservation and describe a prototype archive system built by NGDA that satisfies those principles. Finally, we describe NGDA’s work in developing a wikiand repository-based format registry. Geospatial data characteristics Geospatial data refers to the wide variety of scientific and government-produced datasets that have a geographic component, and that can typically be viewed as representing a portion of the Earth’s surface in some way. This class of information encompasses remote-sensing imagery, aerial photography, maps, data produced by both fixed and mobile geographically-embedded sensors, and data created and processed by GIS (Geographic Information System) tools. The following are some characteristics of geospatial data that are relevant to its preservation. No uniform data model. Geospatial data spans a wide variety of data organizations: vector and raster; topological and non-topological; over domains both discrete and continuous. Geospatial applications and file formats support differing subsets and aspects of these data organizations, and to varying degrees. One attempt at defining a universal, public data model for geospatial data has been made, the USGS SDTS format, but it has failed to achieve widespread adoption. As a consequence, it is not possible to speak of “geospatial data” as a single type of quantity that can be handled by multiple, functionally equivalent applications and formats. Proprietary formats. Many geospatial formats, particularly GIS formats, are proprietary and therefore closely tied to applications. Furthermore, as is typical with formats driven by marketplace competition, they are frequently subject to backwardly incompatible revisions over time. Multiple granule sizes. In contrast to textual information, which has been successfully modeled using multi-page, (hyper)textual documents as the sole granule size, geospatial data is regularly processed at varying granule sizes. The granule sizes range from individual features having geographic location, geometry, and related attributes; to homogeneous, thematic layers of features; to integrated, heterogeneous databases. Data can be aggregated, disaggregated, and operated on with some fluidity. Each of these granularities has its uses, affords different functionality, and poses different preservation challenges. As a 3 http://mcmcweb.er.usgs.gov/sdts/ consequence, there is no single preservation problem for geospatial data; instead, choosing which level or granule size to address, and therefore identifying the preservation problem(s), is a first step of the process. Relational data systems. Geospatial data managed by GIS tools is more and more often being stored in “geodatabases”: relational databases with geographic extensions. The virtue of the geodatabase—that it provides a unified, seamless environment in which to store complex relationships among heterogeneous features—is also a bane for preservation, as it means that it is often not possible to extract individual components out of the database without losing information. And geodatabases inherit all the problems of preserving relational databases: the need to take snapshots of running database systems; storage of snapshots in proprietary database dump formats; complex dump formats; and large, monolithic snapshot files. Large size. The size of geospatial data is large by any measure, with datasets commonly having gigabyte granularities and with some datasets growing by terabytes per day. Long-lived programs. Geospatial datasets can be long-lived: satellite-based sensor programs may run for years, even decades. As a consequence, it becomes necessary to begin archiving datasets long before they are “finished.” Traditionally this has been addressed by binding datasets to storage systems that inevitably become obsolete even within the program’s lifetime, but archival systems of the future that hope to lower both the cost of preservation and the risk of information loss will need to be designed to allow easy turnover and handoff of ever-evolving components and technologies. Extensive context. Capturing and preserving enough of the context surrounding geospatial data to support the data’s future interpretation and use can be challenging. Whereas format information by itself is sufficient to support future renderability of multimedia documents (e.g., knowledge of the PDF format is sufficient to render PDF documents, and therefore usability by humans), geospatial data can require much more, and more complex, contextual information. Using remote-sensing imagery in scientific modeling requires detailed knowledge of platform and sensor characteristics, and in many cases calibration and processing steps as well. Strictly speaking, such contextual information constitutes metadata, but in practice, being voluminous, it is not handled as such (for example, it is not stored in metadata records bundled with the data). Implicit context. In many cases, the context surrounding geospatial data is implicit and embedded in small, relatively insular scientific communities. Dynamic data. Some datasets, particularly Climate Data Records (CDRs), may need to be periodically reprocessed from source datasets in response to corrections and improvements in calibration and Earth models. Thus the context for these datasets must include not only information for their use, but information for their (re)processing as well, including software, algorithms, workflows, ancillary calibration tables, and other artifacts. And, in addition to simply storing such information, it must be possible to re-execute workflows, implying that lineage relationships between datasets and source datasets must be actively maintained. In the larger view, science datasets reside in a dynamic ecosystem of related datasets, and to preserve a dataset means to preserve the dataset’s ability to function in that ecosystem. From these characteristics we conclude that several challenges arise in preserving geospatial data over those already imposed by the general digital preservation problem. Whereas a multimedia document typically resides within a single file, geospatial data may reside in complex, multi-file objects. Whereas the interpretation of a PDF document may be defined by the format label “PDF,” and in turn by an entry in a central format registry, geospatial data may require extensive, product-specific context to interpret. Whereas a thesis or journal article is fixed upon publication, geospatial data can remain dynamic indefinitely due to the lifetime of the generating program and the need to be periodically reprocessed. Relay-supporting preservation architectures We now turn to NGDA’s work on preservation architectures. In thinking about how information can be preserved, it is natural to focus on the system that will house the information: a system must be built to hold the information and make it accessible; the system’s purpose is (at least in part, if not wholly) to preserve the information; and hence, it is tempting to think, by building the system, the preservation of the information will have been addressed. This line of thinking is particularly attractive if the system supports preservation-related functionality such as format migration. But if our goal is to preserve the information for a century or longer, it is evident that any system, no matter how well-designed or well-supported or preservation-supporting, is destined to become obsolete and unsupportable long before the century mark. Currently, storage systems become obsolete within a few years; storage media technologies, within a decade. At the next level up, in NGDA’s experience in running libraries and data centers, we have found it very difficult to keep any type of data management system (repository system, digital library, etc.) running for even a decade. And at the highest level, curators and institutions themselves come and go over time. Few institutions can guarantee their own existence over a century, let alone their ability to continuously preserve and curate any particular piece of information. Instead, as Chris Rusbridge of the Digital Curation Centre has observed, long-term preservation is more likely to resemble a series of short-term guarantees measured in decades or less. Thus we argue that preservation takes the form of an extended “relay” over time [5]. Preserving digital information for a century will require a series of handoffs, occurring repeatedly at many levels: between different types of media and storage subsystems, different object frameworks and organizational schemes, different repository systems, different institutions and policy regimes, and different, diverse application communities. The design of such an archive relay for digital information must focus on achieving the kind of interoperability that maximizes the ease with which such handoffs can successfully be made, in spite of the heterogeneity that will be introduced at many steps along the way [3]. Furthermore, the problems in making successful handoffs are likely to be exacerbated over time as archives of the future find themselves curating older and older digital information. Given our short digital history, most archives today are in the fortunate position of working with recently-created information; that is, with information types that are still current and well-understood in their respective communities. But if we consider our 100-year reference timespan, archives from the middle to the end of that span will be faced with curating information for which all links to the original creators and context have been severed. To see this, one only has to consider the challenges, in the year 2009, of curating digital materials created in 1959, or 1909. Architectural principles NGDA has identified three architectural design principles that extend the recommendations made by the OAIS standard [2] and that we believe are necessary to support preservation of information across long-lived chains of curators and preservation systems. Relay principle The “relay” principle states: a preservation system should support handoff of its archived content to the next preservation system in succession; that is, the preservation system should support its own migration. (Note that we’re distinguishing migration of the system itself here from migration of archived content within the system, e.g., file format migration.) Furthermore, the system should support its own migration at the archive, repository system, and storage system levels independently, to accommodate the different rates at which handoffs occur at these different levels and the different challenges that arise in each case. If we take as a running, simplified example a user managing a set of photographs on a personal computer, with the photo management program (iPhoto, Picasa, etc.) playing the role of the repository system, then this principle states that the management program should support migration of the user’s photo library to another, different photo management program. (This principle is perhaps analogous to recent calls for Web 2.0 data ownership and portability principles as exemplified by the DataPortability Project.) This principle also requires that, independently, the photo management program support handoff of just the storage of the photo library, for example, from one disk or computer or storage system to another.
منابع مشابه
Curation and Preservation of Complex Data: The North Carolina Geospatial Data Archiving Project
The North Carolina Geospatial Data Archiving Project (NCGDAP) is a three-year joint effort of the North Carolina State University Libraries and the North Carolina Center for Geographic Information and Analysis focused on collection and preservation of digital geospatial data resources from state and local government agencies. NCGDAP is being undertaken in partnership with the Library of Congres...
متن کاملPresenting a Morphological Based Approach for Filtering The Point Cloud to Extract the Digital Terrain Model
The Digital terrain model is an important geospatial product used as the basis of many practical projects related to geospatial information. Nowadays, a dense point cloud can be generated using the LiDAR data. Actually, the acquired point cloud of the LiDAR, presents a digital surface model that contains ground and non-ground objects. The purpose of this paper is to present a new approach of ex...
متن کاملMetadata Capture and Geospatial Records
When the electronic records that you are trying to preserve are unique, complex, and storage-hungry, they will quickly put an institution’s feet to the fire to come up with solutions. This has been the case for Utah, North Carolina, and Kentucky as we have tried to grapple with the needs and requirements of geospatial records in the grant-sponsored GeoMAPP project (http://www.geomapp.net). Much...
متن کاملThe National Geospatial Digital Archives - Collection Development: Lessons Learned
There are many similarities between building a geospatial digital archive and building a hard-copy map collection, and two major ones are the necessity to have a collection development policy and the amount of hard work required to seek out and acquire the resources. Two institutions, University of California at Santa Barbara and Stanford University, the initial partners in the National Geospat...
متن کاملDemarcation of Groundwater Prospective Zones in Humid Tropical River Basin: A Geospatial Approach
Groundwater, being a vital resource, needs to be developed with proper understanding about its occurrence in time and space. Unscientific sand mining is a dominant environmental issue in this humid tropical river basin namely Bharathapuzha river basin geographically on central part of Kerala state, southwest part of India. The sandy layers along the river course declines its water holding capac...
متن کامل